This notebook aims at
data.frames, tibbles, data.tables, …) using dplyr or any other query language (as provided for example by data.table)We investigate life tables describing countries from Western Europe (France, Great Britain –actually England and Wales–, Italy, the Netherlands, Spain, and Sweden) and the United States.
We load the one-year lifetables for female, male and whole population for the different countries.
The meaning of the different columns:
mx: Central death rate between ages x and x+n where n=1, 4, 5, or ∞ (open age interval)
qx: Probability of death between ages x and x+n
ax: Average length of survival between ages x and x+n for persons dying in the interval
lx: Number of survivors at exact age x, assuming l(0) = 100,000
dx: Number of deaths between ages x and x+n
ex:: Life expectancy at exact age x (in years)
But some of the columns need retyping:
Year: should be integerAge: needs some cleaning, after cleaning it should be typed as integerLx: should be integerTx: should be integernumeric)LIFE_TABLES directory. Henceforth, the universal table is named life_table, its schema is the following.| Column Name | Column Type |
|---|---|
| Year | integer |
| Age | integer |
| mx | double |
| qx | double |
| ax | double |
| lx | integer |
| dx | integer |
| Lx | integer |
| Tx | integer |
| ex | double |
| Country | factor |
| Gender | factor |
Coercion introduces a subtantial number of NA warnings. Preliminary inspection of the data suggests that coercion problems orginate from column Age: 110+ cannot be coerced to an integer value. We discard corresponding rows using tidyr::drop_na(Age).
We notice that the death rates for new borns are much higher for Italy and Spain than for the rest of the european countries and for the USA. This difference is still noticable for infant mortality. But for the adults, the death rates are pretty much the same for all countries. The difference for young people’s mortality could be explained by the different economic and health conditions at that time between the different countries.
We can see that the ratio between central death rate in Netherlands and central death rate in the USA is less than 1, which means that the central death rate in Netherlands is lower than central death rate in the USA in 1948. But we can also see that this ratio is greater than 1 for almost all the other European countries, which means that the central death rate in the majority of the European countries is higher than the central death rate in the USA in 1948, especially for Italy. This difference could be explained by the health conditions and the financial situation of the two continents at that time.
mx) for both genders as a function of Age for years 1946, 1956, ... up to 2016 .We notice that the mortality quotients of young people in 1946 is smaller in the USA than in all the European countries. This is certainly due to the fact that the USA didn’t suffer a lot from human loss during the WWII, unlike the European countries.
We modify our dataframe so it has the following schema:
| Column Name | Column Type |
|---|---|
| Year | integer |
| Age | integer |
| mx | double |
| mx.ref_year | double |
| Country | factor |
| Gender | factor |
where (Country, Year, Age, Gender) serves as a primary key, mx denotes the central death rate at Age for Year and Gender in Country whereas mx_ref_year denotes central death rate at Age for argument reference_year in Country for Gender.
Spain, Italy, France, England & Wales, USA, Sweden, Netherlands.But as we did since the beginning, we concentrate on the comparison between te USA and the Netherlands.
In the USA, the ratio of mortality rates between all the years after 1946 and the year 1946 has always been under 1 for all ages, which meens that since 1946 people die less in the USA than in 1946. Whereas in the Netherlands, this ratio has been higher than in the USA for all years and especially for the older ages. We also notice a difference for the new borns between the two countries : the ratio is twice higher for the USA in 1956 than in the Netherlands, which means that the mortality rate in 1946 was much higher in the Netherlands compared to the other years, whereas the difference is smaller for the USA between 1946 and the other years. The ratio becomes higher in the USA than in the Netherlands for the age 25 since 2006.
New borns die more than children of ages 1 and 5 for both genders and for all countries, and children of age 5 die the less. We don’t notice any difference between the mortality quotients of the two genders, for all the countries. We can see some noticable peaks for the Netherlands corresponding to the two world wars, that don’t appear on the US plot, for the 3 different ages. Also, for both the USA and our European country, the mortality quotients were obviously higher 100 years ago than what they are nowadays.
Gender and Country :We observe that the mortality quotients for male are higher than for women for ages 15 to 60 for both countries. We can also see that the Netherlands peaks are still observable and they are even sharper, for the same years (WWI and WWII). Also, for both the USA and our European country, the mortality quotients were obviously higher 100 years ago than what they are nowadays.
life_table, we then compute another dataframe called life_table_pivot with primary key Country, Gender and Year, with a column for each Age from 0 up to 110. For each age column, the entry should be the central death rate at the age defined by column, for Country, Gender and Year identifying the row.The resulting schema looks like:
| Column Name | Type |
|---|---|
| Country | factor |
| Gender | factor |
| Year | integer |
0 |
double |
1 |
double |
2 |
double |
3 |
double |
| \(\vdots\) | \(\vdots\) |
\[ ex = \sum_{} \prod_{} 1-mx \]
Year at ages \(60\) and \(65\), facetted by Gender and Country :1948:2010. Then we extract the corresponding lines from life_table_pivot, with taking logarithms of central death rates. Once we did all that, we perform principal component analysis :Our scree plot displays how much variation each principal component captures from the data. Since our scree plot is a steep curve that bends quickly and flattens out, the first two PCs are sufficient to describe the essence of the data. So we can say that PCA works well on our data.
We see on the correlation circle that the infant mortality is inversely correlated with life expectancy. Indeed, all the advanced ages are tending down whereas the younger ages are tending up, on the left side of the circle. And the mx arrow is going to the right side of the circle. But we have to consider the fact that the oldest ages reprensent a small percentage of the total population.
We see that the recent years are more distributed on the right side of the biplot, which means that they follow the direction of mx on the correlation circle. So the PCA allows us to conclude that the life expectancy is getting higher as time goes by
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [19] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## [37] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## [55] 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
## [73] 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
## [91] 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
## [109] 108 109
## <0 rows> (or 0-length row.names)
## A1 Moisture Management Use Manure
## 1 2.8 1 SF Haypastu 4
## 2 3.5 1 BF Haypastu 2
## 3 4.3 2 SF Haypastu 4
## 4 4.2 2 SF Haypastu 4
## 5 6.3 1 HF Hayfield 2
## 6 4.3 1 HF Haypastu 2
## 7 2.8 1 HF Pasture 3
## 8 4.2 5 HF Pasture 3
## 9 3.7 4 HF Hayfield 1
## 10 3.3 2 BF Hayfield 1
## 11 3.5 1 BF Pasture 1
## 12 5.8 4 SF Haypastu 2
## 13 6.0 5 SF Haypastu 3
## 14 9.3 5 NM Pasture 0
## 15 11.5 5 NM Haypastu 0
## 16 5.7 5 SF Pasture 3
## 17 4.0 2 NM Hayfield 0
## 18 4.6 1 NM Hayfield 0
## 19 3.7 5 NM Hayfield 0
## 20 3.5 5 NM Hayfield 0
During the last century, in the USA and in western Europe, central death rates at all ages have exhibited a general decreasing trend. This decreasing trend has not always been homogeneous across ages.
The Lee-Carter model has been designed to model and forecast the evolution of the log-central death rates for the United States during the XXth century.
Let \(A_{x,t}\) denote the log central death rate at age \(x\) during year \(t\in T\) for a given population (defined by Gender and Country).
The Lee-Carter model assumes that observed loagrithmic central death rates are sampled according to the following model \[ A_{x,t} \sim_{\text{independent}} a_x + b_x \kappa_t + \epsilon_{x,t} \] where \((a_x)_x, (b_x)_x\) and \((\kappa_t)_t\) are unknown vectors that satisfy \[ a_x = \frac{1}{|T|}\sum_{t \in T} A_{x,t}\qquad \sum_{t\in T} \kappa_t = 0 \qquad \sum_{x} b_x^2 =1 \] and \(\epsilon_{x,t}\) are i.i.d Gaussian random variables.
1933 up to 1995.Life tables and demography
Graphics and reporting
Tidyverse
PCA, SVD, CCA